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ABSTRACT 

In this paper we describe our efforts to bring sci- 
entific data into the digital library. This has re- 
quired extension of the standard WWW, and also 
the extension of metadata standards far beyond the 
Dublin Core. Our system demonstrates this tech- 
nology for real scientific data from astronomy. 
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INTRODUCTION 

In the last few years we have seen the evolution of 
the Internet and WWW from a loose connection of 
information sources toward information rich digital 
libraries, (e.g., [pi, 



from many heterogeneous data services distributed 
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[m i) In our view, a digital 
library is an organized way to locate, access, and 
analyze digital artifacts of many kinds, federating 
many data services to create a virtual information 
space. 

The evolving digital library may play a key role for 
scientists by providing a unified environment for 
information discovery and access. In particular, the 
digital library can go beyond the traditional library, 
and provide direct, immediate location and access 
to both literature and data. 

At NCSA, the Emerge project has been construct- 
ing the basic infrastructure required for interopera- 
ble searching and for analysis of many kinds of data 



about the network. |18| In our current work, we are 
implementing a prototype to demonstrate the effec- 
tiveness of this technology for searching and access- 
ing astronomy data. This prototype illustrates how 
our flexible architectures can be applied to a col- 
lection of existing systems, to create an enhanced 
environment for information discovery. [21| This 



work has been a collaboration of astronomers and 
computer scientists and NCSA, NASA, University 
of Ulster, and elsewhere. The NCSA Astronomy 
Digital Image Library (ADIL) has been the key 
testbed for demonstrating the technology. ||2^, ^ 

In this paper, we describe our model and prototype 
implementation for interoperable search and anal- 
ysis as applied to scientiflc digital libraries. Our 
model places constraints on standards necessary for 
meaningful interoperability. In particular, we will 
show that the Dublin Core alone is insufficient to 
support the metadata associated with complex sci- 
entific data. However, appropriate standards can 
facilitate richer forms of research tuned to a dis- 
tributed scientific environment. 

Digital Libraries for Science 

For scientists, the digital library is particularly im- 
portant because of the importance of digital data 
and analysis, and the need for timely and rich in- 
formation exchange. Today, scientific discovery and 
communication routinely creates and uses many kinds 
of significant digital components: 

• raw data 

• analyzed data 

• imagery 

• analysis environments 

• simulations 



• notes, letters, and reports 

• published articles 

These digital artifacts are complex and inter-related; 
for example, a digitally published article "points" 
to the data, instrumentation, and software which 
are described and interpreted by the text. Simi- 
larly, the digital representations of simulations and 
analyzed data such as images are most useful and 
valid in the context of the published documentation 
and scientific reports. An archive of scientific data 
will contain pointers from the data to published ar- 
ticles which explain and validate them; and scien- 
tific articles contain pointers to the data which they 
report. Similarly, theoretical results in the form of 
computational models are intended to be correlated 
with relevant observational and experimental data. 

There are many repositories of scientific data al- 
ready on line, and each new scientific project al- 
most inevitably produces significantly larger amounts 



of digital data. |32] Through the use of the World 
Wide Web and URLs, scientific information is al- 
ready becoming a rich web of connected digital in- 
formation. However, it remains a significant and 
lasting challenge for humans to exploit this rich- 
ness, to discover, access, and understand the knowl- 
edge that may reside or be created from digital re- 
sources. We believe that these archives should be 
an integral part of the digital library of the future, 
bringing together all types of scientific information 
resources in a single environment. 

The Emerge project at NCSA is developing prac- 
tical infrastructure for this new type of digital li- 
brary. In this vision, a student or researcher could 
"go to the library" to ask a scientific research ques- 
tion. For example, a researcher could seek to in- 
quire about the climate in Illinois in recent years. 
Even today, the library would provide pointers to 
published literature about weather, vegetation, wild- 
life, and so on, much of which is available on-line. 
The results should also provide pointers to rele- 
vant climate data, satellite imagery, computational 
models, and resources such as email archives. In 
most cases, these resources already exist and are 
available on the Web, but locating and accessing 
this diverse set of materials would be difficult with- 
out the organizing and facilitating role of a new 
kind of research library. This kind of digital library 



will not only make routine scientific information 
finding more efficient, it will enable cross-discipline 
and synergistic discovery; since the investigator will 
likely be presented with information from many un- 
expected sources. 

A Case Study: Astronomy and Space Science Data 

In recent decades, we have experienced a golden age 
for the exploration of the universe. New ground 
and space based instruments and powerful com- 
puting systems have produced an explosion of as- 
tronomy and space science data. This explosion 
has driven the development of data archives, digi- 
tal libraries, and other network-based services that 
make it easier to access research-quality informa- 
tion. The success of such services has created envi- 
ronments within which one can gather knowledge 
from diverse sources to address new scientific ques- 
tions. 

The Astronomy Digital Image Library (ADIL) Testbed 

The NCSA Astronomy Digital Image Library 
(ADIL) has been the key testbed for demonstrating 
the technology. The ADIL was developed with sup- 
port from NASA and the National Science Founda- 
tion to address some of the challenges of distribut- 
ing scientific data over the network. |24, |2H| Its 
specific mission is to collect fully processed astro- 
nomical images in FITS format (a standard astro- 
nomical image format |jl5|) and make them avail- 
able to the research community and the interested 
public via the World Wide Web. 

The ADIL allows users to search, browse, and down- 
load astronomical images. As we will discuss below, 
this can be a non-trivial process when the images 
are not in the usual GIF or JPEG formats. 

The ADIL is more than a tool for astronomers look- 
ing for images to augment their research. It is also 
a means for authors who wish to share their im- 
ages with the community. While many of the Li- 
brary's images come from observatories, the core of 
the collection comes from individual authors. The 
ADIL provides a way to upload the images to the 
Library, along with any supporting data, where it 
can be processed and made available to the Library 
users. 

Authors deposit images into the Library in the form 
of collections we refer to as "projects" . Normally, 



an author would niake a deposit at the end of some 
scientific study when the resulting publication is 
going to press; all the fully processed images asso- 
ciated with that paper would make up the project. 
In this way, the ADIL is part of the new paradigm 
for scientific publishing. |24, 25 1 



EXTENDING THE WWW MODEL: A CONVERSATION WITH 
THE DATA' 

In the conventional WWW model, which is biased 
toward small, text-oriented documents, a data loca- 
tion service usually returns a set of URLs pointing 
to documents which the user must visit-i.e. down- 
load to the client-to view and analyze. For scien- 
tific data, "search-and-download" is not a practical 
model because the objects are typically not "docu- 
ments" , but rather large, complex objects (datasets) 
stored in formats not supported by standard browsers 
(such as FITS |5| or HDF [|l9|). In earlier work, 
we described the need for a "conversation with the 
data" which extends the standard Web model. |^, 
13f The basic scenario is: 



2. each HTML-formatted query response and as- 
sociated browsing documents must be visited for 
visual interpretation in a series of individual, state- 
less requests, e.g., a list of links to URLs. 

3. browsing of the data items is limited to what 
is provided by each data providers' interfaces. 

Also, some kinds of user interaction are difficult to 
implement with server-side browsing. For instance, 
drawing a bounding box or dynamically fiddling 
with color maps is difficult to implement well on a 
server. 

To address these issues we extend the "search-browse- 
and-download" model by adding: 

1. stateful communication with a data-provider, 

2. support for standard query profiles, 

3. support for standard record format appropri- 
ate for scientific data. 



1. search to locate candidate data objects 

2. browse and select the objects 

3. download selected data for further analysis 



Scientific archives have adapted to the Web by inte- 
grating a browsing stage to the information discov- 
ery process (e.g. [^, |3^). In so-called server-side 
browsing, the data provider presents a preview of a 
dataset which might include a GIF or JPEG ren- 
dering of the data and a display of some subset of 
the associated metadata, all packaged as an HTML 
document. The ADIL is a good example of such a 
data service. [0 

This model can work very well when interacting 
with a single data provider and a set of datasets 
that isn't too large. However this model becomes 
quite laborious to the user when trying to inter- 
act with more than a few weterogeneouwith s data 
providers because: 

1. a single question must be entered differently 
into each of the providers' custom interfaces 



The result is a more fiuid, automated, and efficient 
interaction with multiple data providers. Users can 
interact with data from one provider while queries 
to other providers are being processed. The brows- 
ing can occur with different levels of detail. Per- 
haps the most powerful feature is that clients can 
take greater control of the browsing by plugging in 
specialized visualizers for quick plotting or manip- 
ulating of the results. 

INFRASTRUCTURE FOR INFORMATION DISCOVERY FOR 
SCIENTIFIC RESEARCH 

Information discovery is increasingly the most crit- 
ical component of scientific research. As scientists 
work to solve problems, they need multi-modal ac- 
cess to geographically-distributed collections of large 
and highly structured data sets. Discovering which 
data sets are potentially relevant to a particular 
problem involves more or less elaborate character- 
izations of the data in terms of domain-specific at- 
tributes. Furthermore, examining candidate data 
sets to locate the most relevant ones involves highly 
specialized interactions with the data. 

Yet this diversity must not come at the cost of in- 
teroperability. For science to progress, it is cru- 
cial that scientists be able to locate information 



from many different scientific domains when at- 
tempting to solve a problem in their own domain. 
Cross-disciplinary researchers should not be bur- 
dened with a different set of information discovery 
software tools for each discipline they work in, es- 
pecially in the respects that those software tools 
perform essentially the same functions. Also, data 
services should provide data not only to individual 
end users but also to other services which add value 
to them. 

Even within a single discipline, it may be neces- 
sary to query many data repositories in order to 
locate all the data relevant to a scientific question. 
For instance, the NASA Space Science Data Sys- 
tem Technical Working Group reports a real sce- 
nario based on the investigation of sulfur (^2) on 
comets. This investigation turned out to require 
data from multiple sources, including several space- 
craft, ground based telescopes, the Hubble Space 
telescope, and published (and unpublished) scien- 
tific literature. The data was retrieved from many 
different sources, in widely different formats. ( |p!6| . 
Appendix 2) 

A distributed information discovery infrastructure 
should be built which emphasizes standard search 
protocols, file formats and general purpose-tools. It 
should be designed in such a way that profiles, for- 
mats, and browsers specific to a particular domain 
can be easily plugged in to the infrastructure and 
shared between data providers and consumers. 

In a sense, the WWW already provides a semblance 
of such an infrastructure. HTTP supports a variety 
of file formats and forms-based CGI services can be 
used to implement search tools which return views 
of information with additional forms controls for 
manipulating the view. However this mode of using 
the WWW does little to advance a standard query 
syntax or means of defining metadata schemas and 
profiles. Also, it fails to separate the user interface 
for information retrieval (the HTML forms) from 
the delivery of information itself (the metadata in 
the page). Search results returned as part of an 
HTML page are not standardized and cannot be 
easily be compared to similar results from a differ- 
ent service. 



NCSA Emerge: Practical Infrastructure For Information 
Discovery 

The NCSA Emerge Project is addressing these is- 
sues, with the goal of designing infrastructure to 
create unified information discovery across hetero- 
geneous databases; developing free software based 
on standards (Z39.50 [||, XML Q). The Emerge 
software includes [p^]: 

• the Gazelle gateway, which adds Z39.50 to a 
database (using Z39.50 is recommended but not re- 
quired) 

• the Gazebo search gateway, which manages searches 
across multiple heterogeneous databases 

• a Java client toolkit, which communicates with 
the Gazebo gateway, creating queries and present- 
ing results. 

This infrastructure is being developed for several 
applications, including engineering literature |^ and 
medical research databases, as well as astronomy. 



Profiles 

In order to build a distributed search infrastructure 
which serves the scientific community, we see the 
need for the following requirements: 

• Profiles which are extensible to particular do- 
mains 

• Protocols for remote access to data collections 

• Query syntax and semantics for searching data 
collections 

• File formats for data sets or subsets (e.g., XML 
document types) 

• Flexible record formats for metadata describ- 
ing data items (e.g., XML document types) 

In order to search across such a diversity of sources 
with a single query, there must be some sort of 
"common denominator", a common set of search 
terms with shared semantics. The Dublin Core and 
related W3C efforts provide this kind of standard 
for many kinds of "document-like objects". |23, 28| 



However, scientific data require metadata consid- 
erably beyond the Dublin Core. For example, in 
addition to the Dublin Core categories, the ADIL 
supports searches by: 

• Sky position (e.g., in galactic coordinates) 

• Astronomical Object (e.g., M31) 

• Type of object (e.g., galaxy) 

• Wavelength 



Still other types of metadata are needed for archives 
of planetary data, such as orbital positions and de- 
scriptions of atmospheres and clouds. While each 
discipline-and maybe even each project and instrument - 
may require some unique metadata, we believe there 
is enough common ground to establish standard 
profiles for broad classes of scientific data, just as 
the Dublin Core has done for documents. (See our 
proposals for astronomy metadata in [0, [2q|.) 

There must also be standards for the format of 
the results of queries for data; a structured record 
that describes the kinds of objects returned as data, 
such as images, tables, and datasets in various for- 
mats such as FITS. USMARC records have served 
this role for many years for bibliographic material, 
but new, more flexible record formats are needed to 
support scientific data. The W3C RDF and schema 
initiative p9[ provides a sound framework for ex- 
pressing such records, but it is critical for communi- 
ties to work to establish the appropriate standards. 

AML: Astronomical Markup Language, a metadata stan- 
dard for astronomy 

The Astronomical Markup Language (AML) ad- 
dresses the needs for standardized metadata for 
Astronomy data. AML is an XML language de- 
scribing various kinds of data useful in astronomy, 
and is aimed at being an exchange format for as- 
tronomical data, and especially metadata, over the 
Internet. AML is both a proposed profile standard 
and a prototype implementation. 0, ^ 

Results of a search can be formatted as an AML 
document, that is, as an XML document containing 
a description of the resource using the AML DTD. 
The AML document can be processed by a program 
or presented by a browser. Guillaume has created 
a Java applet to browse AML documents as easily 
as one would browse HTML documents, but with 
some additional features specific for astronomical 
data. For example, the AML applet displays as- 
tronomical coordinates, and displays measurements 
with the relevant units and uncertainties. 0, [^ 

The use of AML is an improvement for both the 
information providers and the users (who are as- 
tronomers). For the information providers, XML 
separates the data from the user interface, so that 
different data can be used with different user inter- 



faces without any difficulty. A small institute could 
also focus on the information, and let other insti- 
tutes provide user interfaces. For the users, the use 
of the AML browser provides a uniform and uni- 
fied way to access various data coming from differ- 
ent servers. Finally, users wanting to get and pro- 
cess the data automatically, can use the AML doc- 
uments directly, as the AML browser applet does. 
XML is much more useful for this purpose than 
HTML, because HTML documents contain a mix 
of information about both the user interface and 
the data. 

The AML language is organized as seven types of 
objects. An AML document is a collection of AML 
objects, describing different types of information. 
AML objects may contain links to other AML ob- 
jects, and to external objects such as data, images, 
or documents. The AML objects are summarized 
in Table 1. The AML language can be easily ex- 
tended, for example by adding a "Set of images" 
object. 

AML records are designed to allow programs to 
automatically process and analyze the metadata. 
Guillaume has demonstrated techniques for auto- 
matically clustering astronomical information sources, 
e.g., applying a graph partitioning algorithm to the 
keywords and links in AML records. Q One out- 
standing feature of this work was that information 
from diverse sources was successfully correlated, 
because the AML records are standardized. It is 
easy to imagine how this work could be extended 
to support filtering for specific users and selective 
dissemination of information. 

INTERACTION WITH THE DATA 

The search system described above locates informa- 
tion sources, and returns AML records to describe 
them. From the AML records, the user identifies 
data that appears to be of interest. There may in 
fact be a large number of large datasets, so it is 
important for the user to select subsets and sub- 
samples from the data. For instance, it may be the 
case that only one region or time period is required, 
or only certain measurements are relevant. 

Sometimes the metadata itself is not sufficient to 
make the selection, in which case the user needs to 
browse data itself. A low resolution "thumbnail" 



Table 1 : The objects defined by the Astronomical Markup Language 



Object 


Definition 


Metadata 


An AML document is usually composed of the metadata part, 
and of one the other parts. 


Astronomical object 


This describes information about an astronomical object, 
with the identifiers (the names), the coordinates, the object 
type, other information, a list of measurements 


Article 


This part only describes information about an article, includ- 
ing links to the article, if available. 


Table 


Metadata for a table, and a link to the content of the table. 


Set of tables (catalogue) 


A set of tables is a list of tables linked together, with infor- 
mation about the set. 


Image 


Metadata about an image and the way it is stored, and a link 
to the raw data of the image. 


Person 


Information about a person, usually an author of astronomi- 
cal articles. 



image may be viewed, and the user may ask to pan 
and zoom around the image, to examine in detail 
areas of possible interest. Regions of an image may 
be selected with a bounding box, data from tables 
may be examined, from which particular columns 
(fields) and rows selected. It may be useful to make 
simple histograms or other plots, to identify char- 
acteristics of the data, and it may be useful to ma- 
nipulate the color tables or other aspects of the 
display to highlight features of the imagery. The 
dataset may contain tables of data, or other mul- 
tidimensional data structures, all of which need to 
be efficiently navigated in a similar fashion. 

When the precise data of interest is identified, the 
user then requests subsets and subsamples to be 
downloaded for detailed analysis. At this point, the 
data will be input to data analysis programs or sim- 
ulations. These may range from simple graphing 
and spreadsheets, up to complex, multi-supercomputer 
environments. In any case, the results may ulti- 
mately be published, adding new documents and 
datasets to the library. 

In the case of the ADIL, the FITS data might be 
filtered, combined with data and models, and vi- 
sualized. This might be done using AIPS-|--|- or 



a similar package.] 22] The results of the analysis 
would be saved as one or more FITS images, which 
might be entered in the ADIL when the study is 
published in a journal, pi, |25| 



Because of the elaborateness of scientific data, even 
the retrieval step itself can sometimes involve fairly 
complex calculations above and beyond the boolean 
matching typical of bibliographic data; e.g., apply- 
ing a pattern recognition algorithm to a database 
of images. Furthermore, data may need to be pro- 
cessed and formatted even before it's browsed. And 
finally, the scientific investigation may involve anal- 
ysis of multiple data sets to produce a composite 
data product distilled from diverse data from sev- 
eral sources. Today, these types of activities are 
carried out routinely using a heterogeneous, ad- 
hoc assortment of applications, typically special- 
ized applications requiring access to data on local 
disks. In the future, this will increasingly be done 
using "workbenches" (i.e., specialized Web portals 
such as |l^) and ubiquitous computational GRIDs. 
iJ 

A PROTOTYPE IMPLEMENTATION 

Over the past few years NCSA Project 30 has been 
constructing a prototype which provides a sophis- 
ticated "conversation with the data" for astronomy 
data. Our prototype uses NCSA Emerge and AML 
as the basis to build a system to locate, browse, 
and retrieve astronomy data from the NCSA As- 
tronomy Digital Image Library and other data ser- 
vices. 

The data sources are already available through stan- 
dard Web interfaces which return HTML. We have 
added the ability to use Z39.50 to query, installing 



the Gazelle Z39.50 gateway on the data server if 
needed. 

The Gazebo GUI implements a query construction 
interface, which presents one or more profiles, i.e., 
standard sets of query terms and meanings. The 
client configuration is loaded from the Gazebo gate- 
way, so the same client can have many "views" of 
the information space. The current prototype im- 
plements both a "simple" query interface (a single 
list of keywords), and an "advanced" interface (a 
graphical interface to construct a boolean expres- 
sion). The prototype supports a general purpose 
profile for bibliographic searching, and a special- 
ized astronomy profile. 

The results of the query are returned as AML records, 
as well as HTML. Creating AML (XML) records is 
usually a straightforward extension of the existing 
code that generates HTML. 

The Gazebo GUI sends queries, encoded in the 
XML-based Gazebo abstract search language, to 
the Gazebo gateway to be executed on a set of tar- 
get data sources. Gazebo translates each query into 
the native query syntax of each target data source, 
and executes it remotely using the native search 
protocol of each target data source. This behavior 
is highly configurable. Requesting result records 
is handled similarly; Gazebo translates the GUI's 
requests for results into the target data sources' 
protocols. 

The result records returned by typical data sources 
are more or less structured data. Gazebo can re- 
turn them unmodified or it can process them through 
external CGI scripts which can translate them from 
arbitrary file formats into any MIME type. This is 
useful, for instance, for providing HTML views of 
records in non-text formats. The records are passed 
from the gateway to the GUI, which displays them 
appropriately. 

The GUI displays the number of records returned 
by each server, and retrieves and shows the short 
records as requested. When a full record is re- 
quested, the GUI retrieves the record and launches 
an appropriate applet to display it. If the record is 
HTML, it is displayed with the ICE HTML viewer. 
|10| When the record is AML, the AML browser is 
invoked to display it. 



The AML record may contain pointers to abstracts 
and/or datasets. The user may follow these links to 
view the actual data. The abstract will be viewed 
with the HTML viewer. When a FITS dataset 
is selected, the Horizon Image Browser [pQ] will be 
launched to browse the data, and download it if 
desired. 

CONCLUSION AND RELATION TO OTHER WORK 

We have constructed a complete environment for 
locating astronomy information, for examining and 
browsing metadata, and for browsing and accessing 
both the text and the data. Our system is unique in 
that we support both text and data, using a gen- 
eral, standards based protocol. We have defined 
new protocols for describing astronomy data, and 
created a much more complex "conversation" than 
most systems can support. The fiexible configu- 
ration and interoperable standards we use make it 
comparatively easy to add databases. 

It is important to reiterate that the Emerge soft- 
ware is extremely fiexible, and is used for several 
application communities. The astronomy specific 
features are replaceable modules, the system can 
be customized for different user communities. 

The Gazebo gateway superficially resembles many 
conventional Web gateways and portals. However, 
we use Z39.50 to distribute the queries and AML 
to return the results. These standards assure much 
greater interoperability than Web CGI and HTML. 

Z39.50 has been widely used by libraries for many 
years, and there are many efforts to federate Z39.50 
services, such as the CIC Virtual Electronic Li- 
brary. Ip There is also a well established effort to 
standardized metadata for bibliographic resources, 
e.g., the Dublin Core. [23| Our work is important 



because it shows that Z39.50 can be used with sci- 
entific data. Our protocol development and the 
AML extend the principles of the Dublin Core to a 
significant body of scientific data. 

The AML uses standard XML, but is not directly 
related to the still evolving W3C metadata efforts. 
P8|,[29f| As RDF standards become established, AML 
can presumably be aligned with them. For instance. 



the XML RDF schema [g9[ and the XML-Data pro- 
posal 1 31 1 are likely to be important, and the AML 
will follow these standards as they become estab- 



lished. 

The Gazebo gateway and GUI implement a Query 
protocol using XML. The W3C is currently in the 
early stages of defining a standard for representing 



queries in XML. 1 27] When this standard matures, 
Gazebo will support it appropriately. 
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